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Abstract. - We determine the Renyi entropies K q of symbol sequences generated by human 
chromosomes. These exhibit nontrivial behaviour as a function of the scanning parameter q. 
In the thermodynamic formalism, there are phase transition-like phenomena close to the q — 1 
region. We develop a theoretical model for this based on the superposition of two multifractal 
sets, which can be associated with the different statistical properties of coding and non-coding 
DNA sequences. This model is in good agreement with the human chromosome data. 



DNA symbol sequences exhibit a very complicated dy- 
namical structure. There are long-range correlations 
flttTT] which are particularly strong for the non-coding 
sequences (DNA sequences which do not code for the 
production of proteins) whereas the coding sequences 
demonstrate characteristics similar to random-like pro- 
cesses [IH1]- The way in which coding and non-coding 
sequences alternate in the DNA of many organisms is de- 
scribed by a multifractal [T^hTo] . Various approaches have 
been suggested to map DNA sequences onto the dynamics 
of an associated dynamical system, such as correlated ran- 
dom walks [2)15] , or to provide a suitable measure represen- 
tation by formally mapping DNA sequences onto points of 
the unit interval [13II16| . The associated measures, investi- 
gated in detail by Yu et al. for a large variety of organisms 
[12] . exhibit a non-trivial spectrum of Renyi dimensions. 

In this paper we directly apply the known symbolic dy- 
namics techniques of the thermodynamic formalism of dy- 
namical systems [T7H19] to DNA symbol sequences. For 
our data analysis we will concentrate mostly on the human 
genome (chromosome 10) as a working example. For DNA 
the symbol space contains 4 different symbols A,G,T,C de- 
noting the four nucleotides (Adenine, Guanine, Thymine 
and Cytosine) . Translations along the DNA string can be 
regarded as a shift of (correlated) symbols. We are inter- 
ested in the average information production produced by 
this shift, and in the set of all higher-order correlations of 
the symbols. This can be measured by various quantities 
which weight the rare and frequent symbol sequences in 
a different way. In dynamical systems theory, for a sys- 



tem with a generating partition, one defines the dynamical 
Renyi entropies as 

K i = J im Tf7~ — ^ n "52 P(h,i2,---,iN) q , 9^1 (1) 
JV-s-oo N 1 — q 

%\,...,l N 

Ki= lim — V* p(h,h, ■ ■ ■ ,ijv)lnp(ii,i2, • • ■ , *jv) 
./V— >oo Jy £ — ' 

ii,...,ijv 

Here p(i\, 12, • ■ • , In) denotes the probability of the sym- 
bol sequence . . . ,%n- N denotes the length of the 
sequence and q is a parameter taking real values. The 
above sum is taken over all allowed symbol sequences 
ii,t2, ■ • ■ , i Ni he. over all sequences with p(ii, . . . , ijv) 7^ 0. 
K\ is the Kolmogorov-Sinai entropy, a very important in- 
variant in dynamical system theory. Kq is the topological 
entropy, which counts the growth rate of allowed symbol 
sequences for N — > oo. A much more complete character- 
isation is via the set of all K q with q G (—00,00). These 
quite generally measure the information production of the 
dynamical system under consideration. From this set one 
can proceed to the spectrum of dynamical crowding in- 
dices by Legendre transformation (see e.g. [TTHSO] for de- 
tails) . 

For the standard Bernoulli shift of J different sym- 
bols, the symbols are statistically independent and oc- 
cur with equal probability p = 1/J. We thus obtain 



p(ii, ■ ■ ■ , In) = P = J and K q = In J, independent 
of q. If there are non-trivial correlations, and non-uniform 
probabilities, as is the case for DNA sequences, then the 
spectrum of K q becomes nontrivial. As an example, in 
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Fig. 1: (Colour online) The spectrum of Renyi entropies K q 
for human chromosome 10 (black solid line). The dashed line 
shows the corresponding spectrum of an asymmetric tent map 
with 4 symbols and the same 1-point probabilities as chromo- 
some 10. 

figfflthe solid black line shows the multifractal K q spec- 
trum obtained for the human chromosome 10. The spec- 
trum was numerically evaluated by taking into account all 
symbol sequences up to length N = 8. This length is ade- 
quate for representing the asymptotic spectrum, which is 
already reached for values N > 6, as was also reported in 
references [T21IT5] . 

Our goal is to compare the information production of 
symbol sequences of the human genome with those gener- 
ated by simple examples of chaotic maps. A simple exam- 
ple of a dynamical system with a nontrivial K q spectrum 
is the asymmetric tent map (fig. [2k), given on the unit 
interval [0, 1] by 

f(x) = { YU f < < 1 (2) 

The generating partition for this map corresponds to the 
two intervals I\ = [0,w] an I2 = [w,l]. We may wright 
the symbol T' if an iterate x n of / is in I\ and '2' if it is 
in li- The Renyi entropies for this simple model system 
are given by 

K q = ^-L- ]n(«,« + (l -«,)«), (3) 

K\ = inlaw + (1 — w) ln(l — w) 

The above chaotic dynamical system generates symbol se- 
quences consisting of just two different symbols. An ob- 
vious generalisation is to J different symbols, where the 
corresponding piecewise linear map has J/2 maxima (fig. 
[2)3). In this case the K q are given by 

K q = _i_]n(«;9 + «;« + ... + «;«), q ± 1 (4) 
J 

i=l 




Fig. 2: (Colour online) Example of an asymmetric tent map 
with (a) 1 maximum (shift of 2 symbols) and (b) 2 maxima 
(shift of 4 symbols). 

with w\ + W2 + ■ ■ ■ + w.j = 1 • The parameters Wj cor- 
respond to the 1-point probabilities of the occurrences of 
the symbols j. 

For human Chromosome 10, the observed values of 1- 
point symbol probabilities are wi — wa — 0.291921, W2 = 
w c = 0.207966,w 3 = w G = 0.207859 and w 4 = w T = 
0.292219 [15] , The entropies K q of the human genome can 
neither be fitted by the above simple model with J = 2, 
which in the multifractal language corresponds to a two- 
scale Cantor set with a multiplicative measure, nor using 
J = 4, which corresponds to a 4-scale Cantor set, choosing 
the same 1-point probabilities as observed. This is shown 
in fig. [1] The g-dependence of the chromosomes data is 
much more pronounced than that of the corresponding 
asymmetric chaotic map that shifts 4 symbols. We thus 
need a more sophisticated approach to reproduce the ob- 
served multifractal information production of the human 
genome. 

The idea developed in the sequel is to take into account 
the different dynamical properties of the coding and non- 
coding strings which constitute the chromosomes. The 
symbol sequence probabilities are, in general, different for 

each of those regions, and are denoted by p^(i\, In) 

and p( nc \i\, . . . , ijy), respectively. In the following, in- 
spired by the multifractal formalism, we consider se- 
quences of size N as part of longer sequences and we write 
N = — loge, where e is the partition 'box size'. The limit 
N — > 00 corresponds to 'box size' e —¥ 0, and the K q are 
then identical (up to a multiplicative factor) to the D q of 
a multifractal that encodes the dynamical properties. 

When the dynamical partition function 

Z(q):= J2 p(i 1 ,... > i N y^e^- 1 ^ (5) 

ii,...,i N 

is evaluated, there are contributions from both types of 
strings. We thus have 

~ Nj«-V< c) +N nc e^ Kl r\ (6) 

where the numbers N c , N nc determine how many strings 
are in the coding and non-coding region, respectively. If 
N c , N nc are independent of e, then the Renyi entropies of 
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the entire system are given by the term that dominates 
the partition function for e — > 0, i.e 

f min(^ c) ,^ nc) ) for<7> 1 
q \ m^(^Uf') forg<l. ( ' 

In the thermodynamic formalism of dynamical systems, 
this means that the free energy (q— l)K q exhibits a phase 
transition (non-analytic behaviour) at the critical value 
qcriticai = 1 (see also [19] for other systems exhibiting 
phase transitions in the Renyi entropies). Clearly such a 
behaviour can only be seen if one uses other entropy mea- 
sures than the usual KS entropy (corresponding to q = 1) 
for the investigation of the information production of the 
human genome. This once again illustrates the importance 
to study the entire multifractal spectrum K q . 

The above simple phase transition model of K q agrees 
well with the genome data, see fig. [3J Figured shows two 
approximations of the human chromosome data via two 
different multifractal sets. For the modelling multifractal 
sets with J = 4 different symbols were taken into account, 
since the genome consists of 4 nucleotides. For simplicity 
only one effective scale w\ was introduced into each of the 
two sets, leading to 

K q = J—]n(wl + 3w q 2 ), q + \ (8) 
1 — q 

Ki = wi In w\ + 3u>2 In w 2 

where W2 = (1 — u>i)/3. The first one approximates well 
the chromosome 10 data when q — > oo with wi — 0.447 but 
fails in the region q — >• — oo, see fig. [3^, (red circles). The 
second multifractal set approximates the data in the oppo- 
site region, with w\ = 0.126, see fig. |3ji (blue squares). In 
figEb the red-dashed line is a composite of the two multi- 
fractal sets, based on forming the maximum, respectively 
the minimum, according to eq. [7] This approximates the 
data well in the entire q-region. In fig[3j the values of the 
limit entropies K±oo were fitted to give the best coinci- 
dence with the data. Note that the region q — > — oo is 
dominated by very rare symbol sequences and the region 
q — > +oo by the most frequent ones. Also, it should be 
clear that finite size effects demonstrated in the genomic 
data make a sharp phase transition unobservable since, as 
in our numerical analysis, only symbol sequences of finite 
size are investigated. Our hypothesis in the following is 
to associate the blue curve (squares) in fig. |3j with the 
non-coding sequences and the red curve (circles) with the 
coding ones. 

In the thermodynamic formalism of dynamical systems, 
the role of the free energy is played by the function 
r q = (q — l)K q rather than K q itself [17]. It is therefore 
useful to analyze this function in somewhat more detail. r q 
is shown in fig. 01 with the solid black line representing the 
human chromosome 10 and the red triangles originating 
from the composite model. Again we see evidence for the 
presence of a critical value q C riticai with phase-transition- 
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Fig. 3: (Colour online) a) Separate approximations to the 
K q spectrum for q — » — oo (blue squares) and q — > +oo (red 
circles), b) Composite multifractal spectrum (red triangles) 
and multifractal spectrum of Chromosome 10, organism Homo 
Sapiens (solid black line). 



like behaviour. An abrupt change of slope is clearly ob- 
servable in the area < q < 4, though of course the precise 
value of the critical q- value cannot be located due to finite 
size effects. Our model predicts that r q is a continuous 
but non-differentiable function of q at q cr iticai = 1, which 
in the thermodynamic analogy corresponds to a lst-order 
phase transition. The relevant transition area is desig- 
nated in fig. HI by two perpedicular dashed lines. 

So far our composite multifractal model shows a phase 
transition at q = 1, since by construction the two Can- 
tor sets were joint at the q — 1 scale, see eq. [7] On the 
other hand, it is known that the numbers N c and N nc can 
depend on e in a significant way. Long range correlations 
are demonstrated in the noncoding, while short range ones 
are displayed by the coding sequences [r HBITTS] . The struc- 
ture of (mostly) noncoding sequences, as intervowen with 
coding sequences, forms a (multi-)fractal as well. This 
means the above numbers N c and N nc scale with e and 
thus the critical value q cr iticai can shift to different values. 
This is clearly observed in the present data, both in the 
K q spectrum (see fig. [3j) and in the r q one (see fig. [4]). 
Fig. 0] indicates that the critical point is slightly displaced 
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Fig. 4: (Colour online) The r q of Chromosome 10, organism 
Homo Sapiens (solid black line) and of the composite multi- 
fractal spectrum (red triangles) 



to a value q C rUical ~ 2 > 1. As we shall see below, this 
behaviour can be understood from the domination of the 
long range correlated noncoding sequences, N nc >> N c , 
which are known to cover approximately 97% of the hu- 
man genome. 

Mathematically, if we assume that the coding sequences 
scale as 

N c ~ £-<*<= (9) 
and the non-coding ones as 



(10) 



then the critical point q cr iticai is determined by the relative 
dominance of the two exponents in eq. [6j i.e. by the 
condition 

(qcriticai - 1)^ C) ~d c = (qcriticai ~ l)K^ c) - d nc , (11) 

which, depending on the numbers d c and d nc , can shift the 
critical value away from 1. Solving for q cr iticai we obtain 



qcriticai 1 ~T" 



K, 



(nc) 



'■(nc) 



(<=)' 



(12) 



At q » 2 we see from fig. [3] that Kq C> (blue squares) is 
bigger than Kq (red circles). Hence eq. [T^] implies that 
d nc > d c . This, on the other hand, implies 



N nc 



e -d nc >:> N 



-d c 



(13) 



consistent with the fact that the number N nc of non- 
coding sequences dominates over the number jV c of coding 
ones. 

To conclude, we have shown that the information pro- 
duction of the human genome, if regarded as a shift of 
the four symbols A, C, G, T, is very complex and can only 
be fully understood by considering the entire spectrum of 
Renyi entropies K q . The multifractal structure can be ap- 
proximated to a great extent by a superposition of two 



processes, one describing the system for q > q cr itical an d 
one for q < q cr iticali corresponding roughly to coding and 
non-coding DNA characteristics. 
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